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ABSTRACT 



It was theorized that an answer-until-correct 



procedure, vhereby an examinee marks responses to each 
miiltiple-choice question until feedback indicates that the correct 
ansver has been marked, would yield scores of greater reliability and 
validity than conventional number-right procedure. Two papers and an 
application exercise for an undergraduate educational psychology 
class provided criterion measures with which validities of 
oultiple-choice tests scores derived by each procedure were ccopared. 
Findings consistently favored the answer-until-ccrrect method ever 
number-right method in two reliability comparisons and in six 
validity comparisons. Importance and applications of findings are 
discussed. (Author) 
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The purpose of this study was to measure incremental reliability and 
validity resulting from a multiple-choi ce testing procedure whereby an exam- 
inee continues to select answers to each question until the correct answer is 
chosen . 

In content areas assessed by best-answer type multiple-choice items, 
one is tempted to seek a finer discrimination among students on each question 
than that provided by the usual dichotomus scoring. Surely ability to answer 
a question correctly on the second trial implies more competence than ability 
to select the answer only after three or four attempts. Yet potential dis- 
crimination among examinees who fail to answer correctly on the first trial 
is sacrificed by conventional right-wrong scoring. 

Since Pressey^s (1926) early teaching- testing machine, many ingenious 
mechanical, electrical and chemical methods have been devised to provide 
better discrimination per item among students and/or to enable examinees to 
continue responding in a real-to-life fashion until feedback indicates that 
they are correct. 

Gilman and Ferry (1972) recently reported a reliability increase from 
™H ,79 to .93 resulting from the use of an Answer-Unti 1-Correct (AUC) procedure. 

Their studly raises two questions. First, is it reasonable to expect this 
'^•M much reliability increment to occur consistently? Probably not, but their 
, success is luring. Second, is part of whatever reliability gain that can be 

' — ^' expected a function of affective characteristics? It seems reasonable to 

speculate that the immediate feedback inherent in all varieties of AUC media 
may adversely affect the performance of some anxious examinees who happen to 
' J score poorly on the first few it^ms. If it is true that internal consistency 
of cognitive achievement tests is raised as a result of consistent affective 
traits, then this increase in reliability is obtained only at the expense of 
construct validity; such reliability is not a virtue. These considerations 
emphasize the need for studies that investigate criterion-related validities 
of AUC devices against criteria for which this affective consideration is 



i rre levant. 



Procedures 



Thirty-eight undergraduate students in an educational psychology class 

*Paper read at the annual meeting of the American Educational Research 
Association, Chicago, Illinois, April, 1974 
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provided data on (1) eleven 10-itein multiple-choice quizzes, (2) a 50-item 
multiple-choice cumulative final examination, (3) two papers respectively 
dealing with behavioral objectives and with original examples of teaching 
for transfer of learning, and (4) an 82-item true- false interpretative 
exercise that relates human development, learning, and measurement to class- 
room applications. 

In contrast to the usual directions and scoring methods for multiple- 
choice tests, the AUC method used with the above four-option, multiple-choice 
measures involves directing the examinee to indicate his answer to each ques- 
tion by erasing a carbon shield covering a feedback message. If the selected 
answer is correct on the first trial, he has completed the question; otherwise, 
he makes another response 5 etc., until the feedback message signifies that 
the correct answer has been selected. Scores were obtained by subtracting the 
sum of the total number of responses (erasures) made in finding the correct 
answer to every item from the total number of possible responses. 

Also, an inferred (conventional) number- right (INR) score was obtained 
for each multiple-choi ce measure by counting the number of questions answered 
correctly on the first trial. 

Criterion measures were such as to render method of scoring multiple- 
choice items irrelevant to their scores. Each of the two papers was subject- 
ively evaluated on a numeric scale by the instructor. The true-false applica- 
tion exercise was administered without feedback and was scored objectively. 

The odd-even reliability coefficient, corrected for full length, was 
computed for each multiple-choice measure scored by each method. 

To provide validity measures s each quiz and the final examination 
scored by each method was correlated with each of the two papers and with the 
application exercise. 



Findings and Conclusions 

Table 1 summarizes the findings. The left-hand section displays means 
and standard deviations of the experimental variables. The middle section 
contains comparisons of the AUC and INR scores on odd-even reliabilities. 
The mean (computed by use of Fisher's z coefficients) reliability findings 
for the eleven quizzes are reported. In both reliability comparisons, the 
AUC procedure resulted in slightly higher internal consistency than the INR 
scoring method. 

The right-hand side of Table 1 reports three criterion-related valid- 
ity comparisons each for the mean of the quizzes and for the final examination. 
The comparisons shown in the six cells reveal slight to substantial superi- 
ority of the AUC'^tnethod over the INR method. The validity increments for the 
mean of the quiz^zes are equivalent to what could be realized by lenthening 
the quizzes, scored by the INR method, by approximately 10 to 25 per cent. 
But the validity gains for the final examination could not have been achieved 
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by any amount of niere lengthening. Relevance to the criterion measures, as 
well as reliability, has been increased by the AUC procedures. 

Collectively, these highly consistent, albeit statistically non- 
significant, findings suggest that the AUC procedure used in this study merits 
further study. With increasing availability of mechanical, chemical, and 
computerized testing devices, the economic viability of AUC methods is im- 
proving. If replications with varied test content, diverse examinees, and 
heterogeneous criterion measures consistently show superiority of AUC scoring 
over actual number right, as well as INR, scoring, then the method could 
provide a means of significantly enhancing the validity of multiple-choice 
testing. This improvement can be achieved with relative ease in situations 
(e.g., computer-managed instruction) wherein the immediate feedback provided 
by AUC procedures is desired in its own right for its instructional value. 



Table 1 

Reliability and Validity Comparisons 



Multiple-Choice Measures 


Odd- Even 
Reliabi li ty 


Van'c 


i ties 


Variable 


M 


S.D. 


Appli cation 
Exercise 


Paper I 


Paper II 


Mean Quiz AUC 


26.1 


2.7 


,25 
.18 


. 33 
.31 


.24 
.22 


.25 
.23 


Mean Quiz INR 


7.4 


1.5 


Final AUC 


125.8 


11.0 


.77 
.76 


.42 
.31 


CM CM 
CO CM 

• • 


.19 
.15 


Final INR 


34.8 


6.0 
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